ast model
Targeted Data Poisoning for Black-Box Audio Datasets Ownership Verification
Bouaziz, Wassim, El-Mhamdi, El-Mahdi, Usunier, Nicolas
Protecting the use of audio datasets is a major concern for data owners, particularly with the recent rise of audio deep learning models. While watermarks can be used to protect the data itself, they do not allow to identify a deep learning model trained on a protected dataset. In this paper, we adapt to audio data the recently introduced data taggants approach. Data taggants is a method to verify if a neural network was trained on a protected image dataset with top-$k$ predictions access to the model only. This method relies on a targeted data poisoning scheme by discreetly altering a small fraction (1%) of the dataset as to induce a harmless behavior on out-of-distribution data called keys. We evaluate our method on the Speechcommands and the ESC50 datasets and state of the art transformer models, and show that we can detect the use of the dataset with high confidence without loss of performance. We also show the robustness of our method against common data augmentation techniques, making it a practical method to protect audio datasets.
- Europe > France > Île-de-France > Paris > Paris (0.05)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Asia > South Korea > Seoul > Seoul (0.04)
Language Barriers: Evaluating Cross-Lingual Performance of CNN and Transformer Architectures for Speech Quality Estimation
Wardah, Wafaa, Büyüktaş, Tuğçe Melike Koçak, Shchegelskiy, Kirill, Möller, Sebastian, Spang, Robert P.
Objective speech quality models aim to predict human-perceived speech quality using automated methods. However, cross-lingual generalization remains a major challenge, as Mean Opinion Scores (MOS) vary across languages due to linguistic, perceptual, and dataset-specific differences. A model trained primarily on English data may struggle to generalize to languages with different phonetic, tonal, and prosodic characteristics, leading to inconsistencies in objective assessments. This study investigates the cross-lingual performance of two speech quality models: NISQA, a CNN-based model, and a Transformer-based Audio Spectrogram Transformer (AST) model. Both models were trained exclusively on English datasets containing over 49,000 speech samples and subsequently evaluated on speech in German, French, Mandarin, Swedish, and Dutch. We analyze model performance using Pearson Correlation Coefficient (PCC) and Root Mean Square Error (RMSE) across five speech quality dimensions: coloration, discontinuity, loudness, noise, and MOS. Our findings show that while AST achieves a more stable cross-lingual performance, both models exhibit noticeable biases. Notably, Mandarin speech quality predictions correlate highly with human MOS scores, whereas Swedish and Dutch present greater prediction challenges. Discontinuities remain difficult to model across all languages. These results highlight the need for more balanced multilingual datasets and architecture-specific adaptations to improve cross-lingual generalization.
Model and Deep learning based Dynamic Range Compression Inversion
Sun, Haoran, Fourer, Dominique, Maaref, Hichem
Dynamic Range Compression (DRC) is a fundamental process in audio signal processing which aims at changing the dynamic range of a signal. This technique is widely used in various stages of audio production, such as recording, mixing, and mastering, to control the loudness of an audio signal and prevent clipping or distortion [1]. However, the application of DRC often leads to changes in the audio's timbre and perceived quality, making its inversion a challenging task. Thus, inverting DRC is full of interest in the context of audio reverse engineering [2] since it aims at recovering the original dynamic range and audio quality of a signal. This task could find many applications such as signal restoration, remixing, and enhancing creative control. Inverting DRC is a challenging problem which often requires side information with an explicit DRC model and prior knowledge about the DRC parameters to be efficiently processed. There only exist a few studies which directly address the problem of DRC inversion. In [3], the authors consider DRC inversion as a rate-distortion optimization problem using a coder-decoder framework which minimizes both the side-information and the reconstruction error when combined with a specific estimator applied to the compressed signal. In [4], the authors propose a specific DRC model which provides promising reconstruction approximation but require to know exactly the DRC parameters of the compressed signal.
- Research Report > New Finding (0.68)
- Research Report > Promising Solution (0.46)
Coupling Speech Encoders with Downstream Text Models
Chelba, Ciprian, Schalkwyk, Johan
Automatic speech translation (AST) modeling is usually plagued by lack of parallel training data, which limits the success of end-to-end models. Owing to their modular architecture, cascade models for AST have the advantage of leveraging the large amounts of data available to build automatic speech recognition (ASR) and machine translation (MT) models, respectively. The straightforward way of building cascade AST models is to send the 1-best ASR transcription to the text MT model. Yet another advantage of such an architecture is that it is in fact a multi-modal and multi-task one: besides speech, it also accepts text input for translation and it produces ASR output either in stand-alone mode or as a side-product of the AST task. This multi-input/modal view on the AST task is firmly anchored in the reality of practical applications, so we take it as a fundamental design choice: we aim to build a model that delivers both state of the art ASR and MT performance, while optimizing the AST performance within these constraints. Translating ASR 1-best output has the obvious disadvantage that any further training (fine-tuning) on AST parallel data specific to a given domain is unable to back-propagate cross-entropy loss gradient through the interface between the ASR and the MT model. For tighter coupling between ASR and MT modules we follow the approach of (Dalmia et al., 2021) that leverages the 1-best ASR alignment and sends the ASR encoder embeddings aligned with the 1-best ASR sequence to the MT model. This results in a cascade architecture that allows back-propagation gradient to flow from the MT model into the ASR components. The ASR model in our work uses a conformer encoder architecture (Gulati et al., 2020), pre-trained on a large amount of speech data as described in the Unified Speech Model (USM) work (Zhang et al., 2023).
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- (2 more...)
ElasticAST: An Audio Spectrogram Transformer for All Length and Resolutions
Feng, Jiu, Erol, Mehmet Hamza, Chung, Joon Son, Senocak, Arda
Transformers have rapidly overtaken CNN-based architectures as the new standard in audio classification. Transformer-based models, such as the Audio Spectrogram Transformers (AST), also inherit the fixed-size input paradigm from CNNs. However, this leads to performance degradation for ASTs in the inference when input lengths vary from the training. This paper introduces an approach that enables the use of variable-length audio inputs with AST models during both training and inference. By employing sequence packing, our method ElasticAST, accommodates any audio length during training, thereby offering flexibility across all lengths and resolutions at the inference. This flexibility allows ElasticAST to maintain evaluation capabilities at various lengths or resolutions and achieve similar performance to standard ASTs trained at specific lengths or resolutions. Moreover, experiments demonstrate ElasticAST's better performance when trained and evaluated on native-length audio datasets.
Interpreting Pretrained Speech Models for Automatic Speech Assessment of Voice Disorders
Lau, Hok-Shing, Huntly, Mark, Morgan, Nathon, Iyenoma, Adesua, Zeng, Biao, Bashford, Tim
Speech contains information that is clinically relevant to some diseases, which has the potential to be used for health assessment. Recent work shows an interest in applying deep learning algorithms, especially pretrained large speech models to the applications of Automatic Speech Assessment. One question that has not been explored is how these models output the results based on their inputs. In this work, we train and compare two configurations of Audio Spectrogram Transformer in the context of Voice Disorder Detection and apply the attention rollout method to produce model relevance maps, the computed relevance of the spectrogram regions when the model makes predictions. We use these maps to analyse how models make predictions in different conditions and to show that the spread of attention is reduced as a model is finetuned, and the model attention is concentrated on specific phoneme regions.
- Europe > Germany > Saarland > Saarbrücken (0.14)
- North America > Canada > Quebec > Montreal (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- (2 more...)
Adversarial Sparse Teacher: Defense Against Distillation-Based Model Stealing Attacks Using Adversarial Examples
Yilmaz, Eda, Keles, Hacer Yalim
Knowledge Distillation (KD) facilitates the transfer of discriminative capabilities from an advanced teacher model to a simpler student model, ensuring performance enhancement without compromising accuracy. It is also exploited for model stealing attacks, where adversaries use KD to mimic the functionality of a teacher model. Recent developments in this domain have been influenced by the Stingy Teacher model, which provided empirical analysis showing that sparse outputs can significantly degrade the performance of student models. Addressing the risk of intellectual property leakage, our work introduces an approach to train a teacher model that inherently protects its logits, influenced by the Nasty Teacher concept. Differing from existing methods, we incorporate sparse outputs of adversarial examples with standard training data to strengthen the teacher's defense against student distillation. Our approach carefully reduces the relative entropy between the original and adversarially perturbed outputs, allowing the model to produce adversarial logits with minimal impact on overall performance. The source codes will be made publicly available soon.
Patch-Mix Contrastive Learning with Audio Spectrogram Transformer on Respiratory Sound Classification
Bae, Sangmin, Kim, June-Woo, Cho, Won-Yang, Baek, Hyerim, Son, Soyoun, Lee, Byungjo, Ha, Changwan, Tae, Kyongpil, Kim, Sungnyun, Yun, Se-Young
Respiratory sound contains crucial information for the early diagnosis of fatal lung diseases. Since the COVID-19 pandemic, there has been a growing interest in contact-free medical care based on electronic stethoscopes. To this end, cutting-edge deep learning models have been developed to diagnose lung diseases; however, it is still challenging due to the scarcity of medical data. In this study, we demonstrate that the pretrained model on large-scale visual and audio datasets can be generalized to the respiratory sound classification task. In addition, we introduce a straightforward Patch-Mix augmentation, which randomly mixes patches between different samples, with Audio Spectrogram Transformer (AST). We further propose a novel and effective Patch-Mix Contrastive Learning to distinguish the mixed representations in the latent space. Our method achieves state-of-the-art performance on the ICBHI dataset, outperforming the prior leading score by an improvement of 4.08%.
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- Europe > Greece > Central Macedonia > Thessaloniki (0.04)
- Asia > Middle East > Israel > Tel Aviv District > Tel Aviv (0.04)
Multi-Step Dialogue Workflow Action Prediction
Ramakrishnan, Ramya, Elenberg, Ethan, Narangodage, Hashan, McDonald, Ryan
In task-oriented dialogue, a system often needs to follow a sequence of actions, called a workflow, that complies with a set of guidelines in order to complete a task. In this paper, we propose the novel problem of multi-step workflow action prediction, in which the system predicts multiple future workflow actions. Accurate prediction of multiple steps allows for multiturn automation, which can free up time to focus on more complex tasks. We propose three modeling approaches that are simple to implement yet lead to more action automation: 1) fine-tuning on a training dataset, 2) few-shot incontext learning leveraging retrieval and large language model prompting, and 3) zero-shot Figure 1: We propose the problem of multi-step Action graph traversal, which aggregates historical action State Tracking (AST), which involves predicting many sequences into a graph for prediction. We future workflow actions while prior work only predicts show that multi-step action prediction produces one step. We represent predictions as graphs that capture features that improve accuracy on downstream potential branching in future action sequences.
FlexiAST: Flexibility is What AST Needs
Feng, Jiu, Erol, Mehmet Hamza, Chung, Joon Son, Senocak, Arda
The objective of this work is to give patch-size flexibility to Audio Spectrogram Transformers (AST). Recent advancements in ASTs have shown superior performance in various audio-based tasks. However, the performance of standard ASTs degrades drastically when evaluated using different patch sizes from that used during training. As a result, AST models are typically re-trained to accommodate changes in patch sizes. To overcome this limitation, this paper proposes a training procedure to provide flexibility to standard AST models without architectural changes, allowing them to work with various patch sizes at the inference stage - FlexiAST. This proposed training approach simply utilizes random patch size selection and resizing of patch and positional embedding weights. Our experiments show that FlexiAST gives similar performance to standard AST models while maintaining its evaluation ability at various patch sizes on different datasets for audio classification tasks.